Properties of Classical and Quantum Jensen-Shannon Divergence 



Jop BrietQ and Peter Harremoe^ 
Centrum Wiskunde & Informatica, Science Park 123, 1098 XG Amsterdam, The Netherlands 

(Dated: April 14, 2009) 

Jensen-Shannon divergence (JD) is a symmetrized and smoothed version of the most important 
divergence measure of information theory, Kullback divergence. As opposed to Kullback divergence 
it determines in a very direct way a metric; indeed, it is the square of a metric. We consider a family 
of divergence measures (JD a for a > 0), the Jensen divergences of order a, which generalize JD as 
JDi = JD. Using a result of Schoenberg, we prove that JD a is the square of a metric for a £ (0, 2] , 
and that the resulting metric space of probability distributions can be isometrically embedded in 
a real Hilbert space. Quantum Jensen-Shannon divergence (QJD) is a symmetrized and smoothed 
version of quantum relative entropy and can be extended to a family of quantum Jensen divergences 
of order a (QJD a ). We strengthen results by Lamberti et al. by proving that for qubits and pure 
states, QJD;!/ 2 is a metric space which can be isometrically embedded in a real Hilbert space when 
a £ (0,2] . In analogy with Burbea and Rao's generalization of JD, we also define general QJD by 
associating a Jensen-type quantity to any weighted family of states. Appropriate interpretations of 
quantities introduced are discussed and bounds are derived in terms of the total variation and trace 
distance. 

PACS numbers: 89.70.Cf, 03.67.-a 



I. INTRODUCTION 



For two probability distributions P = (pi, . . . ,p n ) and 
Q = (qi, . . . ,q n ) on a finite alphabet of size n > 2, 
Jensen-Shannon divergence (JD) is a measure of diver- 
gence between P and Q. It measures the deviation be- 
tween the Shannon entropy of the mixture (P+Q)/2 and 
the mixture of the entropies, and is given by 

JD(P,Q) = H (^y^) - \{H{P) + H(Q)). (1) 

Attractive features of this function are that it is every- 
where defined, bounded, symmetric and only vanishes 
when P — Q. Endres and Schindclin [1 proved that it 
is the square of a metric, which we call the transmission 
metric (dx). This result implies, for example, that Ba- 
nach's fixed point theorem holds for the space of probabil- 
ity distributions endowed with the metric dx ■ A natural 
way to extend Jensen-Shannon divergence is to consider 
a mixture of k probability distributions P\, . . . , Pk, with 
weights 7Ti, . . . ,7Tfe, respectively. With ir = (tti, . . . , 7r^), 
we can then define the general Jensen divergence as 

(k \ k 

i>* p * -x>i#vPi)- 
i-l / i=l 

This was already considered by Gallager [2] in 1968, who 
proved that, for fixed 7r, this is a convex function in 
(Pi, • ■ ■ , Pk). Further identities and inequalities were de- 
rived by Lin and Wong [3J Sj , and Tops0e [5] . It has found 
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a variety of important applications: Sibson [B] showed 
that it has applications in biology and cluster analysis, 
Wong and You [7] used it as a measure of distance be- 
tween random graphs, and recently, Rosso et al. used it 
to quantify the deterministic vs. the stochastic part of a 
time series [8]. For its statistical applications we refer to 
El-Yaniv et al. [9] and references therein. 

Burbea and Rao [TO] introduced another level of gener- 
alization, based on more general entropy functions. For 
an interval / in R and a function <p : I — > R, they define 
the (jy entropy of x £ I™ (where I n denotes the Cartesian 
product of n copies of /) as 

n 

H<t,{x) = -2^0(xi). 
1=1 

Based on this, they define the generalized mutual infor- 
mation measure as 

(k \ k 

5>,;P -5j><iT* (Pi), 
i-l J i=l 

for which they established some strong convexity prop- 
erties. If k — 2, / = [0,1] and is the function 
x — > -£z\{x a — x), then defines the entropy of or- 
der a. In this case, Burbea and Rao proved that JDJ 
is convex for all n, if and only if a G [1,2], except if 
n — 2 when convexity holds if and only if a. £ [1,2] or 
a £ [3, 11/3]. 

We focus on the functions JD^, where k > 2, J = [0, 1] 
and 4> defines entropy of order a. For ease of notation we 
write these as JD^ if k > 2 and as JD Q if k = 2 and 
7T= (1/2,1/2). 

Shannon entropy is additive in the sense that the en- 
tropy of independent random variables, defined as the 
entropy of their joint distribution, is the sum of their 
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individual entropies. Like Shannon entropy Renyi of or- 
der a entropy is additive but in general Renyi entropy is 
not convex . The power entropy of order a is a mono- 
tone function of Renyi entropy but, contrary to Renyi en- 
tropy it is a concave function which is what we are inter- 
ested in. The study of power entropy dates back to J.H. 
Havrda and F. Charvat [27]. Since then it was rediscov- 
ered independently several times [TT] (T3] [SB] , but we have 
chosen the more neutral term entropy of order a rather 
than calling it Hravda-Chervat-Lindhardt-Nielsen-Aczel- 
Dar'oczy-Tsallis entropy. Entropy of order a is not ad- 
ditive (unless a = 1). This is one of the reasons why 
this function is used by physicists in attempts to model 
long range interaction in statistical mechanics, cf. Tsal- 
lis and followers (can be traced from a bibliography 
maintained by Tsallis). 

Martins et al. [T3J [TT] [T5J [T5] give non-extensive (i.e. 
non-additive) generalizations of JD based on entropies 
of order a and an extension of the concept of convexity 
to what they call q-convexity. For these functions they 
extend Burbea and Rao's results in terms of g-convexity. 

Distance measures between quantum states, which 
generalize probability distributions, are of great inter- 
est to the field of quantum information theory p~7] [18] 
\W\ |2T)1 [2Tj . They play a central role in state discrimina- 
tion and in quantifying entanglement. For example, the 
quantum relative entropy of two states p\ and p2, given 
by S(pi\\p2) = — Trp 1 (lnp 1 — \a.p 2 ), is a commonly used 
distance measure. (For a review of its basic properties 
and applications see [52]). However, it is not symmetric 
and does not obey the triangle inequality. As an alter- 
native, Lamberti et al. [ST] [53] |21] proposed to use the 
(classical) JD as a distance function for quantum states, 
but also introduced a quantum version based on the von 
Neumann entropy, which we denote by QJD. Like its 
classical variant, it is everywhere defined, bounded, sym- 
metric and zero only when the inputs are two identical 
quantum states. They prove that it is a metric on the 
set of pure quantum states and that it is close to the 
Wootter's distance and its generalization introduced by 
Braunstein and Caves [TB] . Whether the metric property 
holds in general is unknown. 

As an analogue to JD^ for quantum states, we intro- 
duce the general quantum Jensen divergence of order a 
(QJD^). In the limit culwe obtain the "von Neumann 
version" : 



QJD"(pi, . ..,p k ) = SI ^npi - ^7ri5(pi), 



where S(p) — — Trpln p is the von Neumann entropy. For 
k = 2 and tt = (1/2, 1/2) one obtains the quantum Jensen 
divergence of order a (QJD Q ), which generalizes QJD as 
lim Q ^i QJD Q — QJD. 



1. Our results. 

We extend the results of Endres and Schindclin, con- 
cerning the metric property of JD, and those of Lamberti 
et al., concerning the metric property of QJD, as follows: 

• Denoting the set of probability distributions on a 
set X by M|(X), we prove that for a £ (0,2], 

the pair ^Mjji(A), JD^ 2 ^ is a metric space which 

can be isometrically embedded in a real separable 
HUbert space. 

• Denoting the set of quantum states on qubits (2- 
dimcnsional Hubert spaces) by B\{H.2) and the set 
of pure-states on rf-dimensional HUbert spaces by 
V(Hd), we prove that for a G (0,2], the pairs 

(s|(H 2 ),QJDy 2 ) and (v{H d ), QJD* /2 ) are met- 
ric spaces which can be isometrically embedded in 
a real separable Hilbert space. 

• We show that these results do not extend to the 
cases a £ (2,3) and a £ (|,oo). More pre- 
cisely, we show that, for a £ (2,3), neither JD a 
nor QJD a can be the square of a metric, and for 
a £ (Z,oo), isometric embedding in a real Hilbert 
space is impossible (though the metric property 
may still hold) . 



2. Techniques. 

To prove our positive results, we evoke a theorem by 
Schoenberg which links Hilbert space-embeddability of a 
metric space (A, d) to the property of negative definite- 
ness (defined in Section IV I. We prove that for a £ (0, 2] , 
JD Q satisfies this condition for every set of probability 
distributions, and that QJD a satisfies this condition for 
every set of qubits or pure-states. 



A. Interpretations of JD" and QJD" 
1. Channel capacity. 

A discrete memoryless channel is a system with input 
and output alphabets X and Y respectively, and condi- 
tional probabilities p(y\x) for the probability that y £ Y 
is received when x £ X is sent. For a discrete memoryless 
channel with \X\ = k, input distribution tt over X and 
conditional distributions P x (y) = p(y\x), we have that 
JD 71 ^ (P Xl , . . . , P Xk ) in fact gives the transmission rate. 
(See for example [25].) Inspired by this fact, we call the 
metric defined by the square root of JD the transmission 
metric and denote it by dx- 

A quantum channel has classical input alphabet A, 
and an encoding of every element x £ X into a quantum 
state p x . A receiver decodes a message by performing a 
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measurement with \Y\ possible outcomes, on the state he 
or she obtained. For a quantum channel with \X\ = k, 
input distribution ir over X, and encoded elements p x , 
Holevo's Theorem [29] says that the maximum transmis- 
sion rate of classical information (the classical channel 
capacity) is at most QJD*' (p Xl , . . . ,p Xk ). Holevo [30] . 
and Schumacher and Westmoreland |31j proved that this 
bound is also asymptotically achievable. 



2. Data compression and side information. 

Let X — [k] be an input alphabet and for each i € X let 
Pi be a distribution over output alphabet Y with \ Y\ = n. 
Consider a setting where a sender uses a weighting n over 
X, and a receiver who has to compress the received out- 
put data losslessly. We call the receiver's knowledge of 
which distribution Pi is used at any time the side in- 
formation, and difference between the average number 
of nats (units based on the natural logarithm instead of 
bits) used for the encoding when the side information is 
known, and when it is not known, the redundancy. In 
[52] . this setting is referred to as the switching model. 

If the receiver always knows which input distribution 
is used, then for each distribution Pi, he or she can ap- 
ply the optimal compression encoding given by H(Pi). 
Hence, if the receiver has access to the side information, 
the average number of nats, that the optimal compression 
encoding uses is given by Yli=i ^iH(Pi). 

However, if the receiver docs not know when which 
input distribution is used, he or she always has to use 
the same encoding. We say that a compression encoding 
C corresponds to an input distribution Q, if C is optimal 
for Q (i.e., the number of nats used is H(Q)). If the 
sender transmits an infinite sequence of letters yxy% ■ ■ ■ 
, picked according to distribution P,-, and the receiver 
compresses it using an encoding C which corresponds to 
distribution Q, then the average number of used nats is 
given by £* =1 Pi(Vj)^ g^y. 

Hence, with the weighting 7r 1; . . . , 7T&, we get the re- 
dundancy 

R(Q) :=J2 ^(^)-E^(%)ln^yj 



= J2^D(P t \\Q), 



a weighted average of KuUback divergences between the 
P^s and Q. The compensation identity states that for 
P = J2i=i KiPi, the equality 



£ n lD (mQ) = E *i D (Pi\\P) + D(P\\Q) (2) 

i=l i=l 

holds for any distribution Q, cf. [3"31 134) . 



It follows immediately that Q = P is the unique 
argmin-distribution for R(Q), and that JD w (Pi, . . . , P/.) 
is the corresponding minimum value. 

Analogously in a quantum setting, let X = [k] be an 
input alphabet, and for each i S X let pi be a state on an 
output Hilbert space Tiy ■ We can think of a sender who 
uses the weighting tt of distributions X, but a receiver 
who has to compress the states on Tiy using as few qubits 
as possible. 

Schumacher |35j showed that the mean number of 
qubits necessary to encode a state pi is given by S(pi). 
Later, Schumacher and Westmoreland |36j introduced a 
quantum encoding scheme, in which an encoding Cq that 
is optimal (i.e., requires the least number of qubits) for a 
state a requires on average S(pi) + S(pi\\a) qubits to en- 
code pi. Hence, when the receiver uses Cq as the encod- 
ing, the mean redundancy is R(o) := Yli=i 7r iS(pi\\o~). 
Let p — J2i=i n iPi- The quantum analogue of ^ is 
given by Donald's identity [57] : 

k k 

^2iTiS(p t \\o-) ='J2w i S(p i \\p) + S(p\\a), 

i=l i=l 

from which it follows that a = p is the argmin-state that 
the receiver should code for, and that QJD 7r (pi, . . . , pt) 
is the minimum redundancy. 



II. PRELIMINARIES AND NOTATION 

In this section we fix notation to be used throughout 
the paper. We also provide a concise overview of those 
concepts from quantum theory which we need. For an 
extensive introduction we refer to |38j . 



A. Classical information theoretic quantities 

We write [n] for the set {1,2, ... ,n}. The set of proba- 
bility distributions supported by N is denoted by M\ (N) 
and the set supported by [n] is denoted by M|(n). We 
associate with probability distributions P,Q E M+(n) 
point probabilities (p\, . . . ,p n ) and (q±, . . . , q n ), respec- 
tively. Entropy of order a ^ 1, Shannon entropy and 
KuUback divergence are given by 



S a (P) 



Pi 



a 



H{P) := ~E^ ln ^ 



i=i 



and 



D(P\\Q) :=J2pM~, 



(3) 



(4) 
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respectively. Note that lim Q( _ > i+ S a (P) — H(P). For 
two-point probability distributions P — (p, 1 — p) we let 
s a (p) denote S a (p, 1 —p). 



B. Quantum theory 

1. States. 

The d-dimensional complex Hilbert space, denoted by 
Tid, is the space composed of all d-dimensional com- 
plex vectors, endowed with the standard inner product. 
A physical system is mathematically represented by a 
Hilbert space. Our knowledge about a physical system 
is expressed by its state, which in turn is represented 
by a density matrix (a trace-1 positive matrix) acting 
on the Hilbert space. The set of density matrices on a 
Hilbert space Tt is denoted by B\ (H) [51] . Rank-1 den- 
sity matrices are called pure-states. Systems described by 
two-dimensional Hilbert spaces are called qubits. As the 
eigenvalues of a density matrix are always positive real 
numbers that sum to one, a state can be interpreted as a 
probability distribution over pure-states. Hence, sets of 
states with a complete set of common eigenvectors can 
be interpreted as probability distributions on the same 
set of pure-states. States thus generalize probability dis- 
tributions. This interpretation is not possible when a 
common basis does not exist. Two states p and a have a 
set of common eigenvectors if and only if they commute; 
i.e. pa = a p. 



2. Measurements. 

Information about a physical system can be obtained 
by performing a measurement on its state. The most 
general measurement with k outcomes is described by k 
positive matrices A±, . . . , A/., which satisfy Yli=i = I- 
This is a special case of the more general concept of 
a positive operator valued measure (POVM, see for ex- 
ample [38 ). The probability that a measurement A of 
a system in state p yields the i'th outcome is Tr(Aip). 
Hence, the measurement yields a random variable A(p) 
with Pr[A(p) = Xi] = Tr(Aip). Naturally, the measure- 
ment operators and quantum states should act on the 
same Hilbert space. 



C. Quantum information theoretic quantities 

For states p, a G B\ (Tl), we use the quantum version 
of entropy of order a, von Neumann entropy and quan- 
tum relative entropy, given by 

1-Tr(p") 

S a (p) := : , 

a — 1 



S(p) :=-Tr(plnp) (5) 

and 

S(p\\a) := Trplnp — Trplncr, (6) 

respectively. Note that lim Q ^i+ S a (p) = S(p). We refer 
to |39j for a discussionof quantum relative entropy. 



III. DIVERGENCE MEASURES 

A. The general Jensen divergence 

Let us consider a mixture of k probability distribu- 
tions Pi, . . . , Pfe with weights 7Ti, . . . , TTfc and let P = 
^2i=i n i-Pi- Jensen's inequality and concavity of Shan- 
non entropy implies that 

(fc \ fe 

5>,p >j2^ H ( p i)- 
1=1 / i=l 

When entropies are finite, we can subtract the right-hand 
side from the left-hand side and use this as a measure of 
how much Shannon entropy deviates from being affine. 
This difference is called the general Jensen-Shannon di- 
vergence and we denote it by JD 7r (Pi, . . . , P&), where 
7r = (tti, . . . , 7Tfc). One finds that 

Ck \ k k 

5>.P - 5>iff(Pi) =^7T i £>(P i ||P) (7) 
t-1 / »=1 »=1 

and therefore 

fe 

JD 7r (P 1 ,...,P fc ) = ^7r^(p||P). (8) 

i=l 

In the general case when entropies may be infinite the last 
expression can be used, but we will focus on the situation 
where the distributions are over a finite set and in this 
case we can use the left-hand side of Q . 

Jensen divergence of order a is defined by the formula 

(k \ fe 

5>*P -^> t s Q (p). 
1=1 / 4 = 1 

Similarly, if p\ , . . . , pk are states on a Hilbert space we 
define 

fe 

QJD 7r (pi,... )Pfe ) = ^; 7 r i 5(p i ||p) ) (9) 

i=l 

where p = $^t=i n iPi- For states on a finite dimensional 
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Hilbert space we have 

(k \ k 

1=1 J i=i 

The quantum Jensen divergence of order a is defined by 

/ k \ k 

QJD^(pi, ...,p k ) = S a y^^ipi -^njSgjpi). 



B. The Jensen divergence 

For even mixtures of two distributions, we introduce 
the notation JD Q (P, Q) for JD a (±P + \Q). That is, 

JD Q (P,Q) := S a (^^) - \s a (P) - \s a (Q). (10) 

For even mixtures of two states the QJD was defined 
in [53], to which we refer for some of its basic proper- 
ties. We consider the order a version of this and write 
QJD Q (p, a ) for QJD a (|p + \a). That is, 

QJD Q (p,<r) := S a (^) - \s a {p) - l -S a {a). (11) 



We refer to (10 1 and (11 1 simply as Jensen divergence 



of order a (JD Q ) and quantum Jensen divergence of order 
a (QJD a ) respectively. 



IV. METRIC PROPERTIES 

In this section we borrow most of the notational con- 
ventions and definitions from Deza and Laurent |40j . We 
refer to this book, to Berg, Christensen and Ressel |48| . 
and to Blumenthal [UJ for extensive introductions to the 
used results. Like Berg, Christensen and Ressel [15] we 
shall use the expressions "positive and negative definite" 
for what most textbook would call "positive and negative 
semi-definite" . 

Definition 1. For a set X, a function d : X x X — > R 
is called a distance if for every x,y G X : 

1. d(x,y) > with equality ifx = y. 

2. d is symmetric: d(x,y) = d(y,x). 

The pair (X,d) is then called a distance space. If 
in addition to 1 and 2, for every triple x,y, z € X . 
the function d satisfies 

3. d(x, y) + d(x, z) > d{y, z) (the triangle inequality), 

then d is called a pseudometric and (X, d) a pseu- 
dometric space. If also, d(x, y) = holds if and only 
if x — y, then we speak of a metric and a metric 
space. 



Our techniques to prove our embeddability results for 
JD Q and QJD a are somewhat indirect. To provide some 
intuition, we briefly mention the following facts. Only 
Definition [T] Proposition [T] and Theorem [3] are needed 
for our proofs. 

Work of Cayley and Menger gives a characterization of 
£2 embeddability of a distance space in terms of Cayley- 
Menger determinants. Given a finite distance space 
(X,d), the Cayley-Menger matrix CM(X, d) is given in 
terms of the matrix Dij = d(xi,Xj), for Xi,Xj € X, and 
the all-ones vector e: 



CM(X,d) 



D 



Menger proved the following relation between £2 em- 
beddability and the determinant of CM(X,d). 

Proposition 1 ([42])- 1st (X,d) be a finite distance 
space. Then (X, d 1 ^ 2 ) is £2 embeddable if and only if 
for every Y <Z X , we have (-1) |F| detCM(Y,d) > 0. 

As an example, consider a distance space with \X\ 
'.\. If we set a := d(x\, a^) 1 / 2 , b :— d{x\, X3) 1 / 2 and 
d(x2, X3) 1 / 2 , then we obtain 

- det CM(X, d) = 

(a + b+c)(a-b- c)(-a + b- c)(-a -b + c). (12) 

On the one hand, this at least zero if d is a pseudomet- 
ric, and hence pseudometric spaces on three points are £2 
embeddable. On the other hand, up to a factor 1/16, the 
right-hand-side of (12) is the square of Heron's formula 



for the area of a triangle with edge-lengths a, b and c. In 
general, Cayley-Menger determinants give the formulas 
needed to calculate the squared hypervolumes of higher 
dimensional simplices. Menger's result can thus be in- 
terpreted as saying that a distance space (X, d 1 ^ 2 ) is £2 
embeddable if and only if every subset is a simplex with 
real hypervolume. 

Returning to our example with \X\ = 3, we also have 
the following implication. 

Proposition 2. Let {{x\, x\, X3}, d) be a distance space. 
Assume that for every c\, C2, C3 £ K such that c\ + C2 + 
C3 = 0. the distance function d satisfies 



CiCjd(x t ,Xj) < 0, 



(13) 



where the summation is over all pairs i,j G {1,2,3}. 
Then ({xi, x%, X3}, d 1 ^ 2 ) is £2 embeddable. 

Proof: Let a := d(x\, X2) 1 / 2 , b := d{x\, x^,) 1 / 2 and c := 
d(x2, X3) 1 / 2 . We first show that (13 1 implies that (12 1 is 



nonnegative. To this end, set c\ = 1, 02 = t, c$ = —t — \ 
where t is a real parameter. Then, if (13 1 holds, we get 
the inequality 

a 2 t + b 2 t(-t - 1) + c 2 (-t - 1) < . 
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The nonnegativity of (12 1 follows from the fact that this 



inequality holds if and only if the discriminant of this 
second order polynomial is at least zero. The result now 
follows from Proposition [T] ■ 

The basis of our positive results in this section is that, 
due to Schoenberg [43j 144] . a more general version of 
Proposition [2] also holds. To state it concisely, we first 
define negative definiteness. 

Definition 2 (Negative definiteness). Let (X,d) be a 
distance space. Then d is said to be negative dcfninitc 
if and only if for all finite sets (ci)i< n of real numbers 
such that c% — 0, and all corresponding finite sets 

{xi)i<n of points in X, it holds that 



} j c i c j d(x i ,x j ) < 0. 



(14) 



In this case, (X, d) is said to be a distance space of neg- 
ative type. 

The following theorem follows as a corollary of Schoen- 
berg's theorem. 

Theorem 3. Let (X, d) be a distance space. Then 
[X, (i 1 / 2 ) can be isometrically embedded in a real sep- 
arable Hilbert space if and only if (X, d) is of negative 
type. 

Note that if isometric embedding in a Hilbert space 
is possible, then the space must be a metric space. We 
define positive definiteness as follows. 

Definition 3 (Positive definiteness). Let X be a set and 
f : X x X — > K a mapping. Then f is said to be pos- 
itive definite if and only if for all finite sets (cj)i<„ of 
real numbers and all corresponding finite sets (xi)i< n of 
points in X , it holds that 



y]cjCjf(xi,Xj) > 0. 



(15) 



Because we are concerned with functions defined on 
convex sets, the following definition shall be useful. 

Definition 4 (Exponential convexity). Let X be a con- 
vex set and <p '■ X — > R a mapping. Then <f> is said to be 
exponentially convex if the function X x X — > M given 
by (x, y) — > <f) (^-j^) is positive definite. 

Normally exponential convexity is defined as positive 
definiteness of 4> {% + y) (as is done in for instance |45 ), 
but the definition given here allows the function 4> only 
to be defined on a convex set. 



A. Metric properties of JD a 

With Theorem [3] we prove the following for Jensen di- 
vergence of order a. 



Theorem 4. For a G (0,2], the space (-Mj(N), JD* /2 ) 

can be isometrically embedded in a real separable Hilbert 
space. 

Note that Theorem [4] implies that the same holds for 
QJD Q for sets of commuting quantum states. 

We use the following lemma to prove that JD Q is neg- 
ative definite for a € (0, 2]. Theorem [4] then follows from 
this and Theorem |3] 

Lemma 1. For a G (0, 1), we have 



-dt, 



T(-a) J 

where T(a) = / °° t a ~ 1 e~ t dt is the Gamma function. For 
a G (1,2), we have 



e' xt - (1 - xt) 



dt. 



Proof: Let 7 G (—1,0). From the definition of the 
Gamma function, we have the following equality: 



1 



1X-7) Jo 

By substituting r — tz we get 
1 



-dt. 



r(-7)7 f +1 

Let (3 G (0, 1) such that j3 = 7 + 1. Integrating z 1 for z 
from zero to y and multiplying by 7 + 1 gives, 



y" = (7+I) / z^dz 
Jo 

1 /•« e -yt _ 1 



dt. 



Now let a G (1,2) such that a = j3 + 1. Integrating y 13 
and multiplying by (3 + 1 gives the result. 

x a = 09 + 1) ['yPdy 
Jo 

1 f°° e~ xt - (1 - xt) , 
'-dt. 



Lemma 2. For a G (0, 2], the distance space (M+, JD Q ) 
is of negative type. 

Proof: Let (ci)i< n be a set of real numbers such that 
Y^i=i Ci ~ 0. For two probability distributions P and Q, 
we have 

JD Q (P,Q) = S a (^y^) - \S«{P) - \s a (Q). 
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Observe that for any real valued, single- variable function 
/, we have J2i c i c jf( x i) — 0. Hence, we only need to 
prove that the function 



P + Q 



(a 



—y 



Pi + q_i 



is negative definite for all a € (0, 2]. From this decompo- 
sition of S a into a sum over point probabilities it follows 
that we need to show that x rx x a is exponentially con- 
vex. Lemma [l] shows that for fixed < a < 1 and fixed 
1 < a < 2, the mapping x rx — x a can be obtained as the 
limit of linear combinations with positive coefficients of 
functions of the type x rx 1 — e~ tx and x rx 1 — e~ tx — tx 
respectively. Each such function is exponentially con- 
vex since the linear terms are, and for non-negative real 
numbers X\, . . . , x n , 



J2c i c j (-e-^+^) = -lj2c i e-^\ <0. 

i,j \i=l I 

The case a = 1 follows by continuity. The case a — 2 also 
follows by continuity, but a direct proof without Lemma[l] 
is straightforward. ■ 

Proof of Theorem [4} Follows directly from Lemma [2] 
and Theorem [3] ■ 

A constructive proof of Theorem [4] for JDi (JD) is 
given by Fuglede [H3 27], who uses an embedding into 
a subset of a real Hilbert space defined by a logarithmic 
spiral. 



B. Metric properties of QJD a for qubits 

Using the same approach as above, we prove the follow- 
ing for quantum Jensen divergence of order a and states 
on two-dimensional Hilbert spaces. 

Theorem 5. For a € (0,2], the space 



(Bi(Wa).QJDi' 



/2 



can be isometrically embedded in a real separable Hilbert 
space. 

This is established by the following lemmas and The- 
orem |3] 

Lemma 3. Let (V, (•(•)) be a real Hilbert space with norm 
II • II2 = (*|") • Then, (V, || ■ |||) is a distance space of 
negative type. 

Proof: The result follows immidiately if we expand the 



distance function || • \\ 2 in terms of the inner product 

^ ^ CiCj (Xi Xj , Xi Xj) 

= ^2ciCj(\\xi\\l + \\xj\H - 2{xi,Xj)) 
2 Ci Cj \\xjW2 — 2 CiCj (xi , Xj ) 

i j hi 

= - 2} j CjCj(x i ,Xj) 



Lemma 4. The distance space (B\(H.2), QJD a ) i a G 
(0, 2] is of negative type. 

Proof: Using the same techniques as in the proof of The- 
orem [4] and the fact that Lemma [l] also holds when x is 
a matrix, what has to be shown is that for p 6 B\{Ji.2), 
the function p rx Tr (exp (— tp)) is exponentially convex. 
Since p acts on a two-dimensional Hilbert space, it has 
only two eigenvalues, A+ and A_, that satisfy A + +A_ = 1 
and \ 2 + + A 2 . = Tr(p 2 ). A straightforward calculation 
gives 



1 (2Tr (p 2 ) - 1) 
A+/_ = - ± 



1/2 



2 2 
Plugging this into Tr (exp(— tp)) gives 



(16) 



Tr (e^P) = 2e- 1 ' 2 cosh Q (2Tr (p 2 ) - l) 1/2 ^j 

00 2k 

= 2e- 1 ' 2 V "T^TTTTTTT (2Tf (p 2 ) - l) * 



k^o ( 2fc ) !4fe 



where the second equality follows form the Taylor expan- 
sion of hyperbolic cosine. The task can thus be reduced 
to proving that (2Tr (p 2 ) — l) is exponentially convex 
for all k > 0. For this we can use the following theorem: 



Theorem 6 (|48j Slight reformulation of Theorem 1.12]). 
Let 4>i i 02 : X rx C be exponentially convex functions. 
Then <f>\ ■ 02 is exponentially convex too. 

This implies that proving it for k = 1 suffices. The 
trace distance of two density matrices is defined as the 
Hilbert-Schmidt norm || • H2 of their difference. Since the 
Hilbert-Schmidt norm is a Hilbert-space metric, Lemma 
[3] implies that (pi,p2) rx \\p\ — P2W1 is negative definite 
and the equality 

||p 1 -p 2 || 2 = Tr(p!-p 2 ) 2 = 2(Trp 2 +Trp 2 )-Tr((p 1+ p 2 ) 2 ) 

implies that the function Tr ((pi + p 2 ) 2 ) is positive defi- 
nite. From this it follows that the function 2Tr (p 2 ) — 1 
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is exponentially convex. 



D. Counter examples 



Proof of Theorem |5j Follows directly from Lemma [4] 
and Theorem [3] ■ 



C. Metric properties of QJD Q for pure-states 

Here we prove that QJD Q is the square of a metric 
when restricted to pairs of pure-states. For a Hilbert 
space of dimension d we denote the set of pure-states as 
P(H d ). 

Theorem 7. For a € (0,2], the space (P(H d ), QJD* /2 ) 
can be isometrically embedded in a real separable Hilbert 
space. 

Lemma 5. The distance space (P(fHd), QJD Q ) , a € 
(0, 2] is of negative type. 

Proof: Using the same techniques as in Theorem [4] 
we have to prove that for p S P(H.d), the function 
p rx Tr(exp(— ip)) is exponentially convex. For pi,p2 € 
P(TCd) such that p\ ^ P2, the matrix Pl + P2 has two non- 
zero eigenvalues, A+ and A_, which can be calculated in 
the same way as above. In this case ( 16 1 reduces to 



x± = \±\(M Pl -p 2 )) 



1/2 



Tr 



When we plug this into Tr(exp(— t(pi + P2))), we get 
(e-*^)) = (n-2) 

+2e-*cosh (t(Tr{ Pl ■ p 2 )) 



a/2 



= (n-2) 



00 ,2k 



(Tr( Pl • P 2)Y 



k=0 



(2k)\ 



where the (n — 2) term comes from the fact that n — 2 
of the eigenvalues are zero. We need to prove that 
(pi,p 2 ) rx (Tr(pi • p2)) k is positive definite for all in- 
tegers k > 0. But Theorem [6] implies that we only need 
to prove it for k = 1. Appealing to the trace distance, 
we have 

- Pall? = Trpl + Trpj - 2Tr(pi • p 2 ), 

Since, by Lemma [3] this is negative definite, the result 
follows. ■ 



Proof of Theorem [7} Follows directly from Lemma [5] 
and Theorem [3] ■ 



Metric space counter example for a £ (2, 3). 



To see that JD Q , and hence QJD Q , is not the square 
of a metric for all a we check the triangle inequality for 
the three probability vectors P = (0, 1) ,Q = (1/2, 1/2) 
and R= (1,0) . We have 



JD Q (P,Q) =m a (Q,R) 
= S„(l/4,3/4) 



S a (1/2, 1/2) 



and 



JD Q (P,iZ)=S a (l/2,l/2). 

The triangle inequality is equivalent to the inequality 

> -2 JD Q (P, Q) - 2 JD Q (Q, R) + JD Q (P, R) 

= -4 (*7 a (1/4, 3/4) - 5a(V 2 2 ' 1/2) ) + S a (1/2, 1/2) 

= 3S Q (l/2,l/2)-4S (l/4,3/4) 

= 3 l-2(l/2)° ^-(1/4)° -(3/4)° 

a — 1 a — 1 

_ 4(l/4) a +4(3/4) Q -6(l/2) Q -l 
a — 1 

We make the substitution a; = (l/2) a and assume a > 1 
so the inequality is equivalent to 

Ax 2 + 4a; ln i»2 n3 _ 6a; - 1 < 0. 

Define the function 

f(x) = Ax 2 +4x 2 ~^ -6x-l. 

Then its first and second derivatives are given by 

/'(x) = 8x + 4^-|^^ 1 -^-6 

.11 < . „ , /„ ln3\ ln3\ in a 

' ^ 8 + 4 ( 2 -h72j { l -^2) X ^ 

and we see that /" (x) = has exactly one solution. 
Therefore / has exactly one infliction point and the equa- 
tion / (x) = has at most three solutions. Therefore the 
equation 

4(1/4)" +4(3/4) Q - 6(1/2)" -1 = 

has at most three solutions. It is straightforward to check 
that a = 1, a = 2 and a = 3 are solutions, so these are 
the only ones. Therefore the sign of 

4 (1/4)" + 4 (3/4)° - 6 (1/2)" - 1 
a- 1 
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is constant in the interval (2, 3) and plugging in any num- 
ber will show that it is negative in this interval. Hence 
JD Q cannot be a square of a metric for a £ (2, 3). 



2. Counter examples for Hilbert space embeddability for 



In the previous paragraph we showed that JD a and 
QJD Q are not the squares of metric functions for a £ 
(2, 3). Hence, for a in this interval, Hilbert space embed- 
dings are not possible. Here we prove a weaker result for 
a £ (|,oo), using the Cayley-Menger determinant. 



Theorem 8. The space 



is not Hilbert space embeddable for a in the interval 
(|,oo) . 

Note that this does not exclude the possibility that 
JD Q is the square of a metric and that the same result 
holds for QJD Q , 



Proof: Consider the four distributions 



- - 3e, - + 3e j 



1 1 

2~ £ '2 +£ 
1 1 
2 +£ '2- £ 



Then the Cayley-Menger determinant is 



a ^) Sq \ 2 2s} Sa (2 ^) Sa (2) ^ 

a (0 2eJ S a ^2 ^) $a (2) ^ot (2 ^) ^ 

5 a Q-e) s a (|) s Q (i+e) s at 2 [l+2e) 1 



1 



s a (\ + e) s Q (i + 2e) s a (i+3e) 1 



1 



1 



1 







and if the four points are Hilbert space embeddable 
then this determinant is non-negative. The function 
£ — * s a ( \ + s) has a Taylor expansion given by 



°« ' (2) _4 , S " ^ (2) 



(6) /1> 



2^ £ 4 + e 8 /(£)j 



24 720 
where / is some continous function of e. This can be used 



to get the expansion of the Cayley-Menger determinant: 

2 



1 us I 1 



CM= -s^ 



8 a \2 I V V a \2 



for some continuous function g [52]. We have the follow- 
ing formula for the even derivatives of s a : 

<4 2 ") ( x ) = -a& (x a - 2n + (1 - x) a - 2n ) 



and 



a (2n) 



2no2n+l— a 



If the Cayley-Menger determinant is positive for all 
small e then 



& a 1 1 - ■ 



; i(i).-(i),o 



or equivalently 



and 



(-^2 5 -«)' - (-^2 3 ~ a ) (-a&2 7 - Q ) < 

> (a^) 2 - (a*) (aP) 
= oAx 4 ((a - 2) (a - 3) - (a - 4) (a - 5)) 

= 4a> a (a-2) (a-3) (a - - 



Hence, the Cayley-Menger determinant is non-negative 
only for the intervals [0, 2] and [3, |] . ■ 



V. RELATION TO TOTAL VARIATION AND 
TRACE DISTANCE 

The results of Section |IV| indicate that interesting ge- 
ometric properties are associated with JD a and QJD a 
when a G (0, 2] . 



A. Bounds on JD a 

For a £ (0, 2]. we bound JD Q as follows: 

Theorem 9. Let P and Q be probability distributions in 
M\(n), and let 

«:=!^foi-«i| G [0,2] 

i 

denote their total variation. Then for a £ (0, 2] , we have 
L < ,m a (P,Q) < U, where: 
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• For every n > 2, L is given by 

L{P > Q) = s a (\)-s a (\ 

• For every n > 3, U is given by 
1/1 1 



U n (P,Q) = 



a — 1 



2<> 



P-QWZ- 



• For n — 2, f7 is given by the tighter quantity 

/..\ 1 



U 2 (P,Q) 



~, S <*2 (I) 



(18) 



(19) 



(20) 



Proof: We start with the lower bound. Let a denote a 
permutation of the elements in [n] and let a (P) denote 
the probability vector where the point probabilities have 
been permuted according to a. Clearly, the function JD Q 
is invariant under such permutations of its arguments: 

m a (a(P),a(Q)) =m a (P,Q). (21) 

Let B denote the set of permutations a that satisfy 

for all i G [n] . Then, by the joint convexity of JD a for 
a 6 [1,2] (as proved in [10]), we have 

JD tt (P, Q) = ±- JD « (°( p )MQ)) 

\ 1 1 a£B 1 1 aeB / 

(22) 



> JD r 



The distributions ^ J2aeB a ( p ) and \W\ J2aeB a 03) 
have the property that they are constant on two com- 
plementary sets, namely {i e [n] \ pi > q{\ and {i £ [n] \ 
Pi < Qi}- Therefore, we may without loss of generality 
assume that P and Q are distributions on a two-element 
set. On a two-element set P and Q can be parametrized 
by P = (p, 1 — p) and Q — (q, 1 — q) . If cr 2 denotes the 
transposition of the two elements then 



v = V 



P + o: 2 (Q) Q + a 2 (P) 



2\p- 



By (|21) and ([22] we get 

'P + o- 2 (Q) Q + a 2 (P) 



JD« (P, Q) > JD Q 
= JD Q 



1 v 1 f 

2 + 4' 2 ~ 4 



1 v 1 u 

2 ~ 4' 2 + 4 
1 v 



"a (1/2) ,,,,, 4 

and this lower bound is attained for two distributions 



on a two element set. Next we derive the general upper 
bound. Define distribution P on [n] x [3] such that for 
every i£ [n], 

P(i,l) = mm{pi,q t } , 



P(i,2) 



Pi - q % if Pi > <Ji 
otherwise, 



P(z,3) = 0, 

and similarly define Q on [n] x [3] by 

Q(i,l) = mm {pi.qi} , 
Q(i,2) = 0, 



Q(t,3) 



ft - Pi if 9j > Pi 







otherwise. 



With these definitions we have V(P, Q) = V (P, Q) . Us- 
ing the data processing inequality and the definitions of 
P and Q it is straighforward to verify that 



JD Q (P,Q) < ,m a (P,Q) 

1 A 1 



a-l v 2 2 C 



5> 



This upper bound is attained on a three element set so 
we have 



U n {P,Q) 



- 1 V 2 



1 . 
2«- 



IP- 



To get a tight upper bound on a two-element set a spe- 
cial analysis is needed. The cases p > q and p < q 
are treated separately, but the two cases work the same 
way. We will therefore assume that p > q. On a two- 
element set parametrize P and Q by P = (p, 1 — p) and 
Q = (q, 1 — q) . In this case we have the linear constraint 
p — q = v/2. For a fixed value of v, we have that JD Q is a 
convex function of q. Therefore the maximum is attained 
by an extreme point, i.e. a distribution where either p 
or q is either or 1. Without loss of generality we may 
assume that q — and that p — v/2. This gives 



U 2 (P,Q) = s a (£j 



It is now straightforward to determine the exact form 
of the joint range of V and JD Q . 

Corollary 10. The joint range of V and JD Q , denoted 
by A n , is a compact region in the plane bounded by a 
(Jordan) curve composed of two curves: The first curve 
is given by (18) with V running from 2 to 0. For n = 2 



the second curve is given by (19 I with v running from 
to 2, and for n = 3 the second curve is given by (20) with 
v running from to 2. 
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Proof: Assume first that n > 3. By Theorem[9]we know 
that A„ is contained in the compact domain described. A 
continuous deformation of the lower curve into the upper 
bounding curve (i.e. a homotopy from the lower bound- 
ing curve to the upper bounding curve ) is given by Pt, 
Qt for t € [0, 1], where 



(«) = (I"*) 2 



2+v 2-v 



2%v 







I f 
I i 



for w £ [0,2]. Therefore, A„ has no "holes". 
n = 2 is handled in a similar way. 



The case 



* .ID 



1 bit 




Figure 1: V/ JD^-diagram for a — 1 and n > 3 (the shaded 
region), and for n = 2 (the region obtained by replacing the 
upper bounding curve by the dotted curve). 



a 



In Figure [T| we have depicted the V/ JD Q -diagram for 
= 1. 



The bounds ( 18 I and ( 19 I give us the following propo- 



sition regarding the topology induced by (JD Q )5. In the 
limiting case a — > 1, this was proved in [5] by a different 
method. 



Proposition 11. The space \ M^_(N), JD^/ 2 I is a com- 
plete, bounded metric space for a € (0, 2], and the induced 
topology is that of convergence in total variation. 



Proof: By expansion of L(P,Q) given by (18), in terms 

(23) 



of the total variation v, one obtains the inequality 

' V \ 2 3 



Taking only the first term and bounding (19), we get 
\v 2 {P,Q) < |V 2 (P,Q) 

< m a (p,Q) 

< ^(i-^HP-oiis 

< l ^V(P,Q). (24) 



B. Bounds on QJD a 

With Theorem [9] we can bound QJD a for a e [1,2]. 
We use the following two theorems. 

Theorem 12 ([49 , Theorem 3.9). Let H be a Hilbert 
space, pi, pi 6 B\ (H) and M. := {Mi | i = 1, . . . , n} be 
a measurement on H. Then S(pi\\p2) > D(Pm\\Qm), 
where PmtQm S M}(n) and have point probabilities 
Pm(}) = Tr(Afipi) and Qm(}) = Tr(Mi/92), respectively. 

Theorem 13 ([38 , Theorem 9.1). Let H be a Hilbert 
space, 

PUP2&B\ (H) 

and M. := {Mi \ i = 1, . . . , n} be a measurement on 7i. 
Then\\pi-p 2 \\i = m&x M V(Pm, Qm), where P m ,Q m e 
M^_(n) and have point probabilities Pm(i) = Tr(Mjpi) 
and Qm{^) — ^{Mipi), respectively. 

Theorem 14. For a € (0,2], for all states p\,pi € 
B>\ (H), we have 



s a( 2 ) 



\Pi - P2W1 



<Qm a ( Pu p 2 ) 



Zn2„ 

<^[|P! 



^Hi- 



Proof: The lower bound is proved in the same way 
as [5H1 Theorem III. 1] , by making a reduction to the 
case of classical probability distributions by means of 
measurements. Let M. be a measurement that max- 



imizes V(PmtQm)- Then from Theorem 13 we have 
V(P M , 



\Pl - P2\\l 



IM) 



QJD a (pi,pa) > ^D[P M 



Theorem 12 
Pm- 



gives us 
- Qm 



2 

Pm ' 



Q 



M 



+ 2 D [Qm\\ — 

= JDck(Pa4! Qm)- 

The result now follows from Theorem [9] The upper 
bound is proved the same way as we proved the clas- 
sical bound. Introduce a 3-dimensional Hilbert space Q 
with basis vectors |1), |2) and |3). On H® Q define the 



12 



density matrices 



= pl+P2 ~ |pi ~ P2 U ]i)(ii 

+ Pl - P2+ 2 IP1 - P2l ®\2){2\, 
P 2 ^ P2 + Pl ' 2 IP2 ' Pll ^U)(l| 

- P2 - Pl+ 2 IP1 - P2l ®\3){3\- 



Let Trg denote the partial trace B]_(H. <g> G) — > 
Then Trg (pi) = pi and Trg (p 2 ) = p2- The matri 
pl ~ p2+ 2 |pl ~ p2 ' and P2 ~ P1 + |P1 ~ P21 are positive definite i 

Pi-P2+\pi-P2\ |2)(2| 



llPi - P2II1 = Tr 



Tr 



f Pi - 92 + |pi - p 2 \ ^ 



hTr 



V2 - Pi + |pi - P2I 



= Tr |/9i - p 2 | = ||pi - ^Hi- 
According to the "quantum data processing inequality' 
[4"§l Theorem 3.10] we have 

qm a ( Pl ,p 2 )<Qjr> 1 (pi,P2) 

= l Tr (^pi±ki^pA 9 m \\ ln2 

+ T^ P2 - pi+ 2 lPl - p2 ^ \3)(3\)ln2 

ln2 II II 



VI. CONCLUSIONS AND OPEN PROBLEMS 

We studied generalizations of the (general) Jensen di- 
vergence and its quantum analogue. For a £ (1,2], JD a 
was proved to be the square of a metric which can be 
embedded in a real Hilbert space. The same was shown 
to hold for QJD a restricted to qubit states or to pure 
states. Both these results were derived by evoking a the- 
orem of Schoenberg's and showing that these quantities 
are negative definite. 

Whether (QJDjJs is a metric for all mixed states re- 
mains unknown. However, based on a large amount 
of numerical evidence, we conjecture the function A — > 
Tr(e A ) to be exponentially convex for density matrices 
A. Proving this would imply that QJD Q is negative defi- 
nite for a S (0, 2], and hence the square of a metric that 
can be embedded in a real Hilbert space. 
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